Human Genetics — Latest Matching Preprints

1

Investigating the Y chromosome in complex disease: Phenome-wide scan across 104,334 Finnish men

Preussner, A.; Leinonen, J. T.; FinnGen, ; Pirinen, M.; Tukiainen, T.

2026-06-10 genetic and genomic medicine 10.64898/2026.06.09.26355235 medRxiv

Top 0.1%

6.2%

Show abstract

Although the Y chromosome represents roughly 2% of the male genome, it is often ignored in genome-wide association studies (GWAS). Subsequently, the potential health impacts of Y-chromosomal genetic variation remain incompletely understood. To fill this gap, we performed a phenome-wide association study (PheWAS) in FinnGen across 1,426 binary and quantitative traits using Y-chromosomal variation (frequency [≥] 1%) in 104,334 genotyped men. As Y chromosome variation is prone to population stratification, we performed carefully adjusted association analyses and further examined these through kin-based validation in 19,275 female and 24,712 male 1st degree relatives. We found 121 suggestive (p < 5.6x10-3) phenotypic associations in the Y chromosome, yet none of these were strong enough to reach phenome-wide significance (p < 3.9x10-6). While only 38 associations were supported in the kin-based validation, intriguingly we found support for a previously suggested link between haplogroup I1 and coronary heart disease (CHD; OR=1.06, 95%CI=1.02-1.11, p=3.7x10-3; male validation OR=1.05; female validation OR=0.97). The I1-CHD association was detected across distinct geographical areas within Finland and was independent from Loss of Y (LOY) and the autosomal risk to CHD, proposing a link between germline Y-chromosomal variation and heart disease risk. Overall, this study presents a comprehensive phenome-wide analysis of Y-chromosomal associations, highlighting the potential relevance of Y-chromosomal variation beyond sex determination. Our findings further emphasize the need for improved capture of Y-chromosomal variants and further analyses in biobank-scale data to allow for deeper exploration of male-specific genetic architecture of complex diseases.

2

FRMPD4, a causal gene for intellectual disability and epilepsy, is associated with X-linked non-syndromic hearing loss

Liedtke, D.; Rak, K.; Schrode, K. M.; Hehlert, P.; Chamanrou, N.; Bengl, D.; Katana, R.; Heydaran, S.; Doll, J.; Han, M.; Nanda, I.; Senthilan, P. R.; Juergens, L.; Bieniussa, L.; Voelker, J.; Neuner, C.; Hofrichter, M. A.; Schroeder, J.; Schellens, R. T.; de Vrieze, E.; van Wijk, E.; Zechner, U.; Herms, S.; Hoffmann, P.; Mueller, T.; Dittrich, M.; Bartsch, O.; Krawitz, P. M.; Klopocki, E.; Shehata-Dieler, W.; Maroofian, R.; Wang, T.; Worley, P. F.; Goepfert, M. C.; Galehdari, H.; Lauer, A. M.; Haaf, T.; Vona, B.

2026-03-30 genetic and genomic medicine 10.64898/2026.03.27.26349271 medRxiv

Top 0.1%

3.7%

Show abstract

Abstract Background Understanding the phenotypic spectrum of disease-associated genes is essential for accurate diagnosis and targeted therapy. FRMPD4 (FERM and PDZ Domain Containing 4) has previously been associated with intellectual disability and epilepsy. However, its potential role in non-syndromic hearing loss has not been explored. Methods We performed genetic analysis in two unrelated families presenting with non-syndromic sensorineural hearing loss, identifying maternally inherited missense variants in FRMPD4. Clinical phenotyping included audiological assessment and evaluation for neurodevelopmental involvement. Cross-species expression analyses were conducted in Drosophila, zebrafish, and mouse. Functional characterization included quantitative evaluation of sound-evoked responses in Drosophila nicht gut hoerend (ngh) mutants, assessment of neuronal development and acoustic startle responses in zebrafish loss of function models, and morphological cochlear analyses with auditory brainstem response measurements in knockout mice. Results Three affected males from two unrelated families presented with prelingual, bilaterally symmetrical sensorineural hearing loss, with confirmed congenital onset in one individual and no evidence of neurodevelopmental abnormalities. Cross-species analyses demonstrated evolutionarily conserved expression of FRMPD4 in auditory structures. In Drosophila, quantitative analysis of sound-evoked responses in ngh mutants revealed impaired auditory function. Zebrafish loss of function models exhibited reduced neuronal populations in the otic vesicle and posterior lateral line, abnormal neuromast development, and diminished acoustic startle responses. In mice, Frmpd4 knockout resulted in high-frequency hearing loss and cochlear abnormalities consistent with the human phenotype. Conclusions Our findings expand the phenotypic spectrum of FRMPD4 to include non-syndromic sensorineural hearing loss and establish its evolutionarily conserved role in auditory function. These results have direct implications for genetic diagnosis and variant interpretation in patients with hearing loss.

3

A pilot genome-wide association study of ischemic heart disease with co-occurring arterial hypertension in a Kazakh cohort

Skvortsova, L.; Yergali, K.; Zhaxylykova, A.; Begmanova, M.; Mansharipova, A.

2026-03-23 genetic and genomic medicine 10.64898/2026.03.19.26348868 medRxiv

Top 0.1%

3.7%

Show abstract

Genome-wide association studies (GWAS) of ischemic heart disease (IHD) remain underrepresented in Central Asian populations. We conducted a pilot GWAS of IHD with co-occurring arterial hypertension in a Kazakh cohort to identify candidate loci for future replication. A case-control GWAS was performed in 451 individuals (236 cases and 215 controls). Genotyping was conducted using the Illumina Infinium Global Screening Array-24 v3.0. Association testing was performed using a logistic regression under an additive genetic model adjusted for age, sex and the first ten principal components (PC1 - PC10). Multiple testing correction was applied using the Bonferroni adjustment. As an additional analysis, knowledge-guided GWAS (KGWAS) followed by MAGMA gene-based testing was used to prioritize candidate genes. After quality control, 345 371 variants were tested. Two loci surpassed the Bonferroni-corrected genome-wide significance threshold: rs28898595 at the UGT1A locus (effect allele C; OR = 0.33, 95% CI = 0.23 - 0.49; p = 3.01x10-8) and rs28709059 in the intron region of the ACTR3C gene (effect allele C; OR = 0.4, 95% CI = 0.29 - 0.55; p = 4.08x10-8). Several additional loci showed suggestive evidence of association. In gene-level analysis, the CSMD1 gene demonstrated a significant association signal in MAGMA consistent with the European (p = 1.16x10-11) and East Asian (p = 9.07x10-11) LD reference panels. This pilot study identifies genome-wide significant loci (UGT1A, ACTR3C genes) and supports CSMD1 gene as a prioritized candidate gene for the complex phenotype of IHD associated with co-occurring arterial hypertension in the Kazakh cohort. These findings are preliminary and require replication in larger Central Asian cohorts and further functional validation.

4

Proteogenomic analysis of 5,411 plasma proteins in sickle cell disease patients

Groza, C.; Chignon, A.; Lo, K. S.; Bellegarde, V.; Bartolucci, P.; Lettre, G.

2026-04-07 genetic and genomic medicine 10.64898/2026.04.06.26350255 medRxiv

Top 0.1%

3.6%

Show abstract

There are few therapeutic options to treat patients with sickle cell disease (SCD), a blood disorder caused by mutations in the {beta}-globin gene that affects >7M individuals worldwide. Combining human genetics and high-throughput proteomics can help identify new drug targets. Here, we present results from a proteogenomic analysis of the plasma proteome in SCD patients. We measured the levels of 5,411 plasma proteins and tested their associations with common genetic variation in 343 SCD patients. After conditional analyses, we identified 560 protein quantitative trait loci (pQTL), including 58 (10%) that are novel. Many of these pQTL are not specific to SCD patients and associate with clinically relevant traits in non-SCD African Americans from the Million Veteran Program (e.g. hemoglobin concentration, triglycerides). The effect sizes of the pQTL is largely concordant between SCD and non-SCD individuals, although we found examples (e.g. APOL1, haptoglobin) with evidence of heterogeneity that suggests an interaction between the plasma proteome and the SCD genotype. Finally, we combine pQTL and genome-wide association study results for fetal hemoglobin (HbF) in a Mendelian randomization analysis to prioritize five proteins that may increase HbF production (ENPP5, LBP, NAAA, PT3X, ZP3).

5

UshEffect-3D: Structure-informed Classification of USH2A Missense Variants for Inherited Retinal Disease

Choudhary, D.; Portelli, S.; Ascher, D. B.

2026-04-27 bioinformatics 10.64898/2026.04.23.720479 medRxiv

Top 0.1%

3.6%

Show abstract

PurposeVariants of uncertain significance (VUS) in USH2A represent a critical interpretive challenge in inherited retinal disease, with over 70% of ClinVar submissions for this gene currently unresolved. We aimed to develop a gene-specific, structure-informed machine learning framework to improve the clinical classification of USH2A missense variant and provide a tractable tool to aid the diagnosis of Usher Syndrome II. MethodsA dataset of 545 curated USH2A missense variants with established clinical classifications was assembled from ClinVar and LOVD. AlphaFold2-predicted domain structures were used to generate local structural descriptors and biochemical features combined with sequence-based evolutionary conservation scores, yielding 153 candidate features reduced to nine via sequential feature selection. Eleven machine learning classifiers were trained using a 10-fold cross-validation strategy, then independently assessed on a blind test set and validated against 78 ACMG-classified pathogenic variants. Model predictions were benchmarked against five general-purpose variant effect predictors and applied to 2639 USH2A VUS from ClinVar. Feature contributions were analysed using SHAP analysis and ablation studies. ResultsThe Random Forest classifier achieved the highest performance on the blind test set, with an MCC of 0.87 and AUC of 0.97. On independent ACMG validation, sensitivity reached 0.73 with perfect precision. UshEffect-3D substantially outperformed all general-purpose predictors, including PolyPhen-2 (MCC = 0.61), AlphaMissense (MCC = 0.42), and ESM-1b (MCC = 0.32). SHAP analysis identified evolutionary conservation as a dominant predictor, with structural stability providing an independent but complementary signal. Applied to 2639 ClinVar VUS, the model prioritised 888 variants (33.6%) as likely pathogenic, particularly enriched within the Laminin N-terminal and Laminin G-like domains. ConclusionsUshEffect-3D demonstrates that gene-specific, structure-informed machine learning substantially outperforms general-purpose variant effect predictors for USH2A missense variant interpretation. This framework provides a high-confidence prioritization resource for the large unresolved VUS burden in this gene to facilitate earlier molecular resolution of USH2A-associated disease. As genedirected therapies for USH2A-associated retinal disease advance toward clinical application, accurate and interpretable variant classification will be essential for equitable patient selection. UshEffect-3D is freely accessible via an interactive web server.

6

Structural distance at the tRNA synthetase active site interface predicts pathogenicity but is captured by AlphaMissense and EVE except among score-ambiguous variants

Liebeskind, K.; Francklyn, C.; Barrantes Reynolds, R.

2026-05-26 bioinformatics 10.64898/2026.05.22.727252 medRxiv

Top 0.1%

3.1%

Show abstract

Variants of uncertain significance have accumulated as genomic sequencing has become more widespread, which complicates rare disease diagnosis and requires substantial resources for re-evaluation. Aminoacyl-tRNA synthetases (ARSs) are a protein family with extensive variant data and well-characterized disease associations, making them an ideal system for investigating the relationship between variant location and pathogenicity. Using structural distance measurements to the ARS-tRNA binding interface combined with existing pathogenicity predictors, AlphaMissense and EVE, we investigated whether explicit structural binding information could improve missense variant pathogenicity prediction. Pathogenic variants were found to cluster significantly closer to the tRNA-binding interface than benign variants (p = 0.0003). Incorporating explicit distance information into a Bayesian mixture model did not substantially improve predictive performance over AlphaMissense and EVE alone, suggesting that these models already implicitly capture relevant structural binding context. However, a clinically important subset of interface variants classified as ambiguous by both existing models identifies a specific gap where explicit structural distance information may provide added discriminative value, but the limited number of clinically validated variants currently available constrains the ability to fully evaluate this potential. Incorporating additional biologically relevant features not captured by existing models, such as protein stability or conformational dynamics, as well as refining structural distance calculations, may further improve classification of this subset. These findings highlight both the power and the limitations of existing pathogenicity predictors and suggest that structurally informed approaches targeting the binding interface represent a promising direction for improving classification of these ambiguous variants that have great clinical significance. Author SummaryAdvances in clinical genetic sequencing have caused increasing identification of genetic variants whose impact on human health is unknown. These "variants of uncertain significance" present a major challenge because their role in causing disease cannot yet be confirmed or ruled out. This study focuses on a specific family of essential enzymes called aminoacyl-tRNA synthetases, which play a critical role in the process of proteins translation. Mutations in these enzymes have been linked to a range of diseases. This project aims to provide a novel method for determining pathogenicity of variants specifically in aminoacyl-tRNA synthetases. We propose that physical proximity of a variant to the functional binding site of the enzyme is influential in determining pathogenicity. We find that this spatial relationship is a meaningful indicator of a variants potential to disrupt normal function.

7

PALM3 and hearing loss: a potential dual diagnosis interfering with novel gene discovery

Najarzadeh Torbati, P.; Hallbrucker, L.; Hofrichter, M. A. H.; Owrang, D.; Setzke, J.; Kilimann, M. W.; Hemmatpour, A.; Rajati, M.; Ghayoor Karimiani, E.; Haaf, T.; Vogl, C.; Vona, B.

2026-04-21 genetic and genomic medicine 10.64898/2026.04.20.26351093 medRxiv

Top 0.1%

3.0%

Show abstract

Hereditary hearing loss is highly genetically heterogeneous, with emerging overlap between genes implicated in early-onset and age-related hearing loss. We report a consanguineous family with autosomal recessive, non-syndromic hearing loss in which the proband harbors a homozygous splice-site variant in PALM3 (NM_001145028.2:c.314+1G>A) and a homozygous missense variant in OTOA. A minigene assay for the PALM3 variant demonstrated aberrant splicing with exon skipping, resulting in a frameshift and a large inframe deletion, both consistent with loss of function and impacting all known transcripts. While the organ of Corti from 12-month-old heterozygous Palm3 mice showed preserved overall architecture, published Palm3 knockout mice exhibit auditory dysfunction, supporting an auditory phenotype with loss of function. Although a dual molecular diagnosis cannot be excluded, the combined genetic, functional, and comparative data support PALM3 as a strong candidate gene for autosomal recessive hearing loss.

8

Assessing the clinical significance of a novel rare variant in Loeys-Dietz Syndrome by combining AI-driven modelling and cell biology

Boukrout, N.; Delage, C.; Comptdaer, T.; Arondal, W.; Jemel, A.; Azabou, N.; Bousnina, M.; Mallouki, M.; Sabaouni, N.; Arbi, R.; Kchaou, S.; Ammar, H.; Hantous-Zannad, S.; Jilani, H.; Elaribi, Y.; Benjemaa, L.; Van der Hauwaert, C.; Larrue, R.; CHEOK, M.; Perrais, M.; Lefebvre, B.; Cauffiez, C.; Pottier, N.

2026-03-31 genetic and genomic medicine 10.64898/2026.03.30.26349510 medRxiv

Top 0.1%

2.5%

Show abstract

Loeys-Dietz syndrome (LDS) is an autosomal dominant connective-tissue disorder caused by genetic variants in TGF-{beta} pathway genes, most often TGFBR1/2. While pathogenic TGFBR2 genetic mutations usually cluster in the kinase domain and disrupt SMAD signalling, distinguishing with confidence those with functional impact on TGFBR2 function from rare benign genetic alterations represents one of the most important ongoing challenges for accurate genetic testing. Therefore, there is a pressing need to develop methods that can improve functional variant interpretation. Here, we describe and characterize the functional impact of a novel genetic variant in the TGFBR2 kinase domain (E431K), in a patient with the clinical diagnosis of syndromic genetic aortopathy. We assessed the structural and functional consequences of this variant using AI-driven molecular modelling and in vitro cell-based assays. A high-quality homology-based model of TGFBR2 was generated and computational mutagenesis based on the structural context and evolutionary conservation was used to forecast variant pathogenicity. Relative to wild type, the variant affects protein stability by disrupting intramolecular interactions and likely induces conformational changes that may affect kinase activity and thus TGF-{beta} signalling. This was experimentally confirmed by showing abnormal protein level and alteration of canonical TGF-{beta} pathway activation. Overall, our results establish that the E431K variant leads to aberrant TGF-{beta} signalling and confirm the diagnosis of Loeys-Dietz syndrome type 2 in this patient.

9

Calibration of in-frame indel variant effect predictors for clinical variant classification

Abderrazzaq, H.; Singh, M.; Babb, L.; Bergquist, T.; Brenner, S. E.; Pejaver, V.; O'Donnell-Luria, A.; Radivojac, P.; ClinGen Computational Working Group, ; ClinGen Variant Classification Working Group,

2026-04-18 bioinformatics 10.64898/2026.04.15.718599 medRxiv

Top 0.1%

2.4%

Show abstract

Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([≤] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.

10

Comprehensive analysis of de novo variants across 2,497 orofacial cleft trios reveals novel genetic drivers of disease

Kurtas, N. E.; Sanchis-Juan, A.; Shin, E.; Curtis, S. W.; Robinson, K. R.; Lee, A. S.; Alade, A. A.; Zhao, X.; Fu, J.; Diaz Perez, K. K.; Gowans, J. J. L.; Eshete, M. A.; Adeyemo, W. L.; Buxo, C. J.; Padilla, C. D.; Poletta, F. A.; Carreno Torres, A.; Wehby, G. L.; Hecht, J. T.; Moreno Uribe, L. M.; Mukhopadhyay, N.; Shaffer, J. R.; Weinberg, S. M.; Murray, J. C.; Beaty, T. H.; Butali, A.; Talkowski, M.; Marazita, M. L.; Leslie-Clarkson, E. J.; Brand, H.

2026-05-24 genetic and genomic medicine 10.64898/2026.05.21.26352934 medRxiv

Top 0.1%

2.1%

Show abstract

Background Orofacial clefts (OFCs) and other palate abnormalities (PAs) are among the most common birth defects worldwide and are characterized by the abnormal formation of the lip and/or palate. Genetic studies have traditionally classified OFC cases as either syndromic, involving OFCs alongside other congenital anomalies, or nonsyndromic, which represent the majority of cases and occur in isolation. Emerging genomic evidence indicates that genes traditionally associated with syndromic forms of OFC can also harbor variants contributing to isolated cases, challenging the notion of a strict dichotomy between these categories and supporting their integration for gene discovery. Methods In this study, we applied multiple analytic approaches to characterize the genetic architecture of OFC and PAs by integrating genomic data from 2,497 trios with an OFC (n=2080) and PA (n=417) affected proband. We compared these findings across OFC subtypes and syndromic status with those from 5,515 control trios to identify enriched biological pathways and mechanisms and to prioritize candidate genes using variant burden testing. Results We observed a significant enrichment of de novo protein-truncating and damaging missense variants in cases compared to controls (OR = 2.17, p = 1.21x10-32), with particularly strong signals in biologically relevant gene sets involving OFC-associated, constrained, Mendelian disorder, and mouse candidate genes. Variant burden testing identified 39 OFC risk genes at FDR [≤] 0.05, which we then integrated with 593 established OFC genes to interrogate the functional underpinnings of OFC via network analysis. This analysis revealed 309 high-order interactor genes not previously associated with OFC. Notably, this OFC network clustered into ten distinct biological pathways, with nucleosome-associated genes showing significant enrichment among cases in our cohort (OR = 14.8, p = 8.1x10-4). In a final integrative step, we combined evidence across all analyses to nominate 231 candidate genes, 32 of which contained at least two deleterious de novo variants in our cohort. Conclusions These findings underscore the value of integrating diverse OFC and PA subtypes, syndromic status, and variant classes to refine the genetic architecture of these disorders, highlighting both phenotypic expansion of known disease genes and the emergence of novel gene-phenotype associations.

11

Evo 2 Predicts Cardiomyopathy-Associated Variants and Elucidates Their Underlying Mechanisms

kurozumi, a.; otsuka, n.; Masamichi, I.; kawakami, t.; Isagawa, T.; kodera, s.; takeda, n.

2026-05-17 genomics 10.64898/2026.05.15.725304 medRxiv

Top 0.1%

2.1%

Show abstract

BackgroundAlthough advances in next-generation sequencing have accelerated the identification of genetic variants in cardiomyopathy, interpreting variants of uncertain significance (VUS) remains a clinical challenge. Evo 2 is a high-resolution genomic artificial intelligence model capable of predicting pathogenicity across large sequence contexts and enabling mechanistic interpretation; however, its application in cardiovascular genetics is limited. Here, we evaluated the utility of Evo 2 for assessing the pathogenicity and underlying mechanisms of cardiomyopathy-associated variants. MethodsWe used Evo 2 to predict the pathogenicity of single-nucleotide variants in cardiomyopathy-related genes listed on ClinVar. We assessed the ability of the model to identify characteristic structural features in both coding and noncoding regions using internal representation such as embeddings, and to infer the molecular mechanisms of variants within these regions. ResultsEvo 2 demonstrated high predictive accuracy for pathogenicity, achieving an AUROC of 0.983 and an AUPRC of 0.915. Notably, sparse autoencoders (SAEs) from embeddings identified features corresponding to higher-order structural features, including coiled-coil and actin-binding domains characteristic of cardiomyopathy-related proteins, and accurately detected mutations known to disrupt these domains. The model recognized the binding motif of the cardiac-enriched transcription factor TBX5 with SAEs and accurately predicted a single-nucleotide polymorphism affecting TBX5 binding affinity after supervised fine-tuning. ConclusionsEvo 2 demonstrated strong performance for both predicting pathogenicity and extracting biological features of cardiomyopathy-associated variants. It may represent a powerful emerging tool for evaluating VUS in cardiovascular medicine.

12

Polygenic risk scores enhance the identification of carriers of monogenic forms of idiopathic pulmonary fibrosis

Alonso-Gonzalez, A.; Jaspez, D.; Lorenzo-Salazar, J. M.; Delgado, A.; Quintero-Bacallado, A.; Ma, S.-F.; Strickland, E.; Mychaleckyj, J.; Kim, J. S.; Huang, Y.; Adegunsoye, A.; Oldham, J. M.; Maher, T. M.; Guillen-Guio, B.; Wain, L. V.; Allen, R. J.; Saini, G.; Jenkins, R. G.; Molina-Molina, M.; Zhang, D.; Kim Garcia, C.; Martinez, F. J.; Noth, I.; Flores, C.

2026-04-18 genetic and genomic medicine 10.64898/2026.04.16.26350967 medRxiv

Top 0.1%

1.9%

Show abstract

BackgroundIdiopathic pulmonary fibrosis (IPF) is a rare disease with a poor prognosis. Disease risk involves rare and common genetic variants. However, an inverse association have been described between them. Accordingly, IPF patients with a higher polygenic risk score (PRS) for IPF are less likely to carry rare deleterious variants and vice versa. Here, we evaluate weather PRS of IPF could serve as an additional criterion to patient prioritisation for rare variant discovery. MethodsWe identified carriers based on the presence of rare qualifying variants (QVs) in genes linked to monogenic forms of pulmonary fibrosis in 888 IPF patients from the Pulmonary Fibrosis Foundation Patient Registry (PFF-PR). Genome-wide association study (GWAS) summary statistics from independent cohorts were used to construct a whole-genome PRS (WG-PRS) using a clumping and thresholding method (C+T) and a Bayesian method (SBayesRC). PRS were also derived from 19 known common sentinel IPF variants (Sentinel-PRS). Logistic regression models were used to evaluate associations between PRS and carrier status. Discriminatory performance was evaluated using area under the curve (AUC) analysis, and comparisons were made with DeLongs test. Validation was performed in 472 IPF individuals from the UK PROFILE cohort. ResultsIPF-PRS were strongly associated with the QVs carrier status: Odds Ratio [OR] 0.65 (95% Confidence Interval [CI] 0.53-0.79) for WG-PRSC+T, OR 0.71 (95% CI 0.59-0.86) for WG-PRSSBayesRC, and OR 0.77 (95% CI 0.63-0.94) for Sentinel-PRS. Adding WG-PRS to the patients personal clinical history improved the prediction of QVs carriers: AUC=0.62 for the clinical model, AUC=0.68 for WG-PRSC+T (DeLongs test, p=9.54x10-4) and AUC=0.66 for WG-PRSSBayesRC (DeLongs test, p=0.02). Adding of IPF-PRS to clinical variables correctly reclassified 22.8% of carriers when using WG-PRSC+T, 20.8% when using Sentinel-PRS, and 16.7% for WG-PRSSBayesRC. WG-PRSSBayesRC and the Sentinel-PRS also demonstrated improved prediction of QVs carriers in telomere-related genes in PROFILE. ConclusionsIncorporating IPF-PRS into a model based on the patients clinical history improves the identification of QVs carriers. Although the overall discriminatory power was moderate, these findings raise de the possibility of using WG-PRS as useful criterion for rare variant discovery in patients with IPF and enhance decision-making.

13

Clinical evidence yield as a framework for evaluating computational predictors and multiplexed assays of variant effect

Shang, Y.; Badonyi, M.; Marsh, J. A.

2026-03-30 bioinformatics 10.64898/2026.03.27.714777 medRxiv

Top 0.1%

1.8%

Show abstract

Interpreting the clinical significance of missense variants of uncertain significance (VUS) remains a major challenge in clinical genetics. Although computational variant effect predictors (VEPs) and multiplexed assays of variant effect (MAVEs) can generate large-scale functional scores, their value is typically assessed using discrimination metrics such as AUROC rather than by the strength of evidence they provide under ACMG/AMP guidelines. Here, we introduce mean evidence strength (MES), a quantitative metric that summarises the pathogenic and benign evidence assigned across missense variants following gene-level Bayesian calibration. Using the acmgscaler framework, we calibrated 12 population-free VEPs across 367 disease genes and analysed 15 MAVE datasets with sufficient clinical data. MES revealed important discrepancies with AUROC, including cases where methods with similar discrimination differed substantially in evidence yield. MAVEs achieved high average MES despite lower AUROC, while several VEPs showed strong discrimination but more limited calibrated evidence. Among predictors, CPT-1 achieved the highest MES and provided moderate or stronger evidence for the largest fraction of ClinVar VUS. MES therefore provides a practical framework for evaluating computational and experimental variant effect datasets in terms of calibrated clinical evidence yield.

14

PAVS: A Standardized Database of Phenotype-Associated Variants from Saudi Arabian Rare Disease Patients

Abdelhakim, M.; Althagafi, A.; SCHOFIELD, P.; Hoehndorf, R.

2026-04-06 genetic and genomic medicine 10.64898/2026.04.05.26350189 medRxiv

Top 0.1%

1.7%

Show abstract

Genotype-phenotype databases are essential for variant interpretation and disease gene discovery. Genetic variation differs among human populations, mainly in allele frequencies and haplotype patterns shaped by ancestry and demographic history. Population-specific genotypes can influence traits and disease risk; this makes population specific characterization important. Most existing resources focus on the characterization of a population's genetic background, but do not represent the resulting phenotypes. We have developed PAVS (Phenotype-Associated Variants in Saudi Arabia), a curated, publicly accessible database that integrates 5,132 Saudi clinical cases from four Saudi cohorts and 522 cases from analysis of a mixed-population cohort, together with 1,856 cases from the Deciphering Developmental Disorders study (DDD) and 9,588 literature phenopackets. Each case record describes patient-level phenotypes, encoded with the Human Phenotype Ontology (HPO), and links them to genomic variants, gene identifiers, zygosity, pathogenicity classifications, and disease diagnoses mapped to standardized disease terminologies. The data is represented in Phenopackets format and as a knowledge graph in RDF. Additionally, a web interface provides phenotype-based similarity search, gene and variant browsers, and an HPO hierarchy explorer. We evaluate the utility of the phenotype annotations for gene prioritization using semantic similarity. While there are clear differences to global literature-curated databases, phenotypes in PAVS can successfully rank the correct gene at high rank (ROCAUC: 0.89). PAVS addresses a gap in population-specific genotype-phenotype resources and provides a benchmark for phenotype-driven variant prioritization in under-represented populations.

15

Systematic assessment of machine learning-based variant annotation methods for rare variant association testing

Aguirre, M.; Irudayanathan, F. J.; Crow, M.; Hejase, H. A.; Menon, V. K.; Pendergrass, R. K.; McCarthy, M. I.; Fletez-Brant, K.

2026-03-20 bioinformatics 10.64898/2026.03.18.712715 medRxiv

Top 0.1%

1.7%

Show abstract

Machine learning-based annotation methods are increasingly used to assess the pathogenicity of genetic variants, but their performance at prioritizing variants for gene-level association testing remains poorly characterized. Here, we systematically benchmark five annotation methods -- CADD v1.6, CADD v1.7, AlphaMissense, ESM-1b, and GPN-MSA -- using four primary gene-based tests and six annotation-level aggregation tests across 14 quantitative traits measured in up to 350,377 UK Biobank participants. Using a novel framework based on Wasserstein dis-tances, we quantify how annotation choice affects test calibration and power. Tests using CADD annotations achieve the highest signal separation, while tests using AlphaMissense annotations exhibit systematically lower calibration. All combinations of methods produced significant re-sults that were enriched (1.8-5.8-fold) for loss-of-function intolerant genes, though tests using GPN-MSA annotations displayed the highest such enrichment. Replication across symmetric phenotypes and loss-of-function burden tests was generally similar across methods. Our anal-ysis provides practical guidance for annotation method selection in rare variant studies and establishes a distributional framework for calibration assessment.

16

Evaluating Individual Level Performance of Polygenic Risk Scores Using Early Onset High Genetic Risk Coronary Artery Disease as a Benchmark

Liang, S.; Kim, M. S.; Sui, Y.; Tan, Y.; Li, L.; Cho, S. M.; Koyama, S.; Liu, Y.; Paruchuri, K.; Chan, A.; Honigberg, M.; Natarajan, P.; Chatterjee, N.; Fahed, A. C.; Yu, Z.

2026-04-18 genetic and genomic medicine 10.64898/2026.04.16.26350801 medRxiv

Top 0.2%

1.7%

Show abstract

Polygenic risk scores (PRSs) are typically validated using population-level metrics, masking variability in individual-level risk prediction and hindering clinical translation. To address this, we introduced a novel framework using a "benchmark" cohort (N=1184) of "unexpected coronary artery disease (CAD)": early-onset patients (<55 years) with a clinical profile--low 10-year risk, no diabetes or severe hypercholesterolemia--that excludes therapy indications. The occurrence of early CAD in these clinically low-risk individuals establishes a "ground truth" for high genetic risk. We evaluated 58 published CAD PRSs and demonstrated a disconnection between population-level performance and individual-level accuracy (proportion of benchmark patients captured). The proportion captured by 58 PRSs varied from 10.8% to 33.1%, and the top-performing score was 2-fold more effective at identifying the benchmark group than established non-genetic biomarkers, such as lipoprotein(a). Furthermore, benchmark patients never captured by any score exhibited significantly healthier lipid profiles. Our framework provides an essential method for validating clinical readiness of PRSs.

17

Long-read sequencing with targeted assembly of the opsin locus accurately evaluates genes in expressed positions

Anderson, Z. B.; Prall, T.; Damaraju, N.; Storz, S. H.; Goffena, J.; Miller, A. L.; Carroll, J.; Neitz, M.; Miller, D. E.

2026-03-19 genetic and genomic medicine 10.64898/2026.03.17.26348636 medRxiv

Top 0.2%

1.7%

Show abstract

The human opsin gene cluster at Xq28 contains highly similar OPN1LW and OPN1MW genes essential for red-green color vision. Current molecular methods cannot accurately analyze this complex locus, limiting diagnosis of color vision deficiencies (CVD) and detection of carrier status. We performed Nanopore long-read sequencing of 206 individuals, comparing alignment-based analysis with targeted de novo assembly. Alignment-based methods performed poorly, whereas targeted assembly achieved 99% concordance for OPN1LW and 92% for OPN1MW copy numbers and resolved gene order in all XY individuals and 87% of XX individuals. This approach detected CVD in 3.2% of XY individuals and identified 8% of XX individuals as carriers, consistent with population estimates. Moreover, it molecularly explained the phenotypic severity in a family with Bornholm eye disease and clarified carrier status in an XX individual suspected of carrying two CVD haplotypes. Our approach provides a comprehensive, reference-free method for accurate analysis of expressed opsin genes and reliable CVD carrier detection.

18

IEKB: a comprehensive knowledge base for inner ear genetics integrating curated associations, cochlear interactions, Bayesian candidate prioritisation, explainable dark-gene support relations, and a scientific entity network

Wang, H.; Chen, W.; Ning, H.; Cai, Y.; Xu, Y.; Hou, X.; Pang, L.; Luo, Z.; Tian, C.

2026-04-09 bioinformatics 10.64898/2026.04.06.716823 medRxiv

Top 0.2%

1.5%

Show abstract

Inner-ear genetics has expanded rapidly, yet the supporting evidence remains dispersed across a vast literature and across resources that typically emphasise loci, variants, or expression data rather than integrated biological interpretation. Here we present the Inner Ear Knowledge Base (IEKB; https://earkb.org), an open database that unifies curated associations, cochlear interaction evidence, candidate prioritisation, explainable support relations, and network exploration for inner-ear research. IEKB was built with an automated agent-assisted curation workflow that combines schema-constrained literature extraction, continuous human monitoring, and final expert review by inner-ear genetics researchers. By systematically analysing 250,696 PubMed-indexed records retrieved across 16,563 screened genes, IEKB curates 6,051 gene-phenotype-disease associations from 2,494 genes across 43 phenotype categories and 4,102 cochlear gene-gene interactions with pathway, cell-type, and experimental context. IEKB further includes a Bayesian "dark matter" module that prioritises 243,071 candidate gene-phenotype associations for 13,229 genes across all 43 phenotypes (global AUC-ROC = 0.8603; global AUC-PR = 0.1674), together with a supervised dark-relation layer that ranks phenotype-specific known-gene support for each candidate and a multi-entity scientific network containing nearly 4,000 entities, 28,616 deterministic edges, and 83,712 literature-derived relational links. The web resource supports interactive search, multi-parameter filtering, gene-detail pages, bibliometric exploration, domain-specific enrichment against IEKB phenotype and disease gene sets, network visualisation, bulk download in CSV, JSON, SQLite, and XLSX formats, and natural-language evidence-grounded question answering through a companion conversational interface (IEKB QA). To our knowledge, IEKB is the first openly accessible inner-ear resource that integrates curated associations, cochlear interactions, probabilistic candidate prioritisation, auditable known-gene support relations for novel candidates, and a multi-entity scientific network within a single database. All data are released without registration under the CC BY 4.0 license.

19

The Power of Partnership: Democratizing Genetic Prevalence to Empower Patient Advocacy

Baxter, S. M.; Singer-Berk, M.; Glaze, C.; Russell, K.; Grant, R. H.; Groopman, E.; Lee, J.; Watts, N.; Wood, J. C.; Wilson, M.; Rare As One Network, ; Rehm, H. L.; O'Donnell-Luria, A.

2026-03-31 genetic and genomic medicine 10.64898/2026.03.30.26349539 medRxiv

Top 0.2%

1.4%

Show abstract

Introduction: Accurate estimation of disease prevalence is crucial for public health and therapeutic development, but traditional methods are often inaccurate. Genetic prevalence, which estimates the proportion of a population with a causal genotype, using allele frequencies from population data, offers an important alternative. Methods: We partnered with 18 Rare As One patient organizations to estimate genetic prevalence for 22 autosomal recessive conditions using population data from two releases of the Genome Aggregation Database (gnomAD). To standardize and democratize these analyses, we developed the Genetic Prevalence Estimator (GeniE), a publicly available tool, for accessible calculations. Results: Conservative carrier frequencies in gnomAD v4.1 ranged from 1/164 to 1/11,888. The median change in genetic prevalence frequency between v2.1 to v4.1 was 0.806. Partnership with patient advocacy groups provided critical real-world context that refined the interpretation of these estimates. Discussion: These findings highlight that genetic prevalence is not a static figure but a dynamic, evolving measure with important caveats that need to be considered. Our study underscores the necessity of re-evaluations as databases expand. By integrating patient-partnered insights with the GeniE platform, we empower the genomics community to maintain transparent, up-to-date, and actionable data for rare disease advocacy and drug development.

20

Genotype-Based Severity Scoring System in Wolfram Syndrome

Oiknine, L.; Tang, A. F.; Urano, F.

2026-03-26 genetic and genomic medicine 10.64898/2026.03.24.26349216 medRxiv

Top 0.2%

1.3%

Show abstract

Wolfram syndrome is a rare genetic disorder characterized by antibody-negative early-onset atypical diabetes mellitus, optic nerve atrophy, sensorineural hearing loss, diabetes insipidus (arginine vasopressin deficiency), and progressive neurodegeneration, with significant variability in disease severity. We assessed the accuracy of a genotype-based severity scoring system to predict the onset of cardinal symptoms in Wolfram syndrome. This system is based on the type of WFS1 variants (in-frame or out-of-frame) and their location relative to transmembrane domains. Severity scores were assigned to 324 patients with documented onset ages for diabetes mellitus, optic atrophy, hearing loss, and diabetes insipidus. Our analysis revealed a clear correlation between severity scores and earlier onset of diabetes mellitus and optic atrophy. Patients with in-frame variants outside transmembrane domains exhibited milder symptoms, especially WFS1 c.1672C>T (p.Arg558Cys) variant, whereas those with out-of-frame variants showed the earliest onset. Severity scores 3 and 4 did not follow the expected progression, suggesting that transmembrane domain involvement in both alleles may result in greater severity. These findings suggest that this scoring system provides valuable insights into the progression of Wolfram syndrome and may guide clinical care. Further refinement may improve its utility for predicting the onset of non-diabetic symptoms.